Creating an Online Dictionary of Abbreviations from MEDLINE
نویسنده
چکیده
Design. Our method uses a statistical learning algorithm, logistic regression, to score abbreviation expansions based on their resemblance to a training set of human-annotated abbreviations. We applied it to Medstract, a corpus of MEDLINE abstracts in which abbreviations and their expansions have been manually annotated. We then ran the algorithm on all abstracts in MEDLINE, creating a dictionary of biomedical abbreviations. To test the coverage of the database, we used an independently created list of abbreviations from the China Medical Tribune. Measurements. We measured the recall and precision of the algorithm in identifying abbreviations from the Medstract corpus. We also measured the recall when searching for abbreviations from the China Medical Tribune against the database. Results. On the Medstract corpus, our algorithm achieves up to 83% recall at 80% precision. Applying the algorithm to all of MEDLINE yielded a database of 781,632 high-scoring abbreviations. Of all the abbreviations in the list from the China Medical Tribune, 88% were in the database. Conclusion. We have developed an algorithm to identify abbreviations from text. We are making this available as a public abbreviation server at \url{http://abbreviation.stanford.edu/}. ■ J Am Med Inform Assoc. 2002;9:612–620. DOI 10.1197/jamia.M1139. JEFFREY T. CHANG, HINRICH SCHÜTZE, PHD, RUSS B. ALTMAN, MD, PHD Affiliations of the authors: Department of Genetics, Stanford Medical Informatics, Stanford, California (JTC, RBA); Novation Biosciences, Stanford, California (HS). This work was supported by NIH LM 06244 and GM61374, NSF DBI-9600637, and a grant from the Burroughs-Wellcome Foundation. Correspondence and reprints: Russ B. Altman, MD, PhD, Department of Genetics, Stanford Medical Informatics, Stanford School of Medicine, Medical School Office Building, X-215, 251 Campus Dr., Stanford, CA 94305; e-mail: . Received for publication: 3/30/02; accepted for publication: 6/26/02. excludes many types of abbreviations that appear in biomedical literature. Authors create abbreviations in many different ways, as summarized in Table 1. Nevertheless, the numerous lists of abbreviations covering many domains attest to broad interest in identifying them. Opaui, a web portal for abbreviations, contains links to 152 lists.4 Some are compiled by individuals or groups.5,6 Others accept submissions from users over the internet.7,8 For the medical domain, a manually collected published dictionary contains over 10,000 entries.9 Because of the size and growth of the biomedical literature, manual compilations of abbreviations suffer from problems of completeness and timeliness. Automated methods for finding abbreviations are therefore of great potential value. In general, these methods scan text for candidate abbreviations and then apply an algorithm to match them with the surrounding text. Most abbreviation finders fall into one of three types. The simplest type of algorithm matches an abbreviation’s letters to the initial letters of the words around it. The algorithm for recognizing this type is relatively straightforward, although it must perform some special processing to ignore common words. Taghva gives as an example the Office of Nuclear Waste Isolation (ONWR), where the O can be matched with the initial letter of either “Office” or “of.”10 More complex methods relax the first letter requirement and allow matches to other characters. These typically use heuristics to favor matches on the first letter or syllable boundaries, upper case letters, length of acronym, and other characteristics.11 However, Yeates notes the challenge in finding optimal weights for each heuristic and further posits that machine learning approaches may help.12 Another approach recognizes that the alignment between an abbreviation and its long form often follows a set of patterns.13,14,15 Thus, a set of carefully and manually crafted rules governing allowed patterns can recognize abbreviations. Furthermore, one can control the performance of the system by adjusting the set of rules, trading off between the leniency in which a rule allows matches and the number of errors that it introduces. In their rule-based system, Pustejovsky et al. introduced an interesting innovation by including lexical information.14 Their insight is that abbreviations are often composed from noun phrases and that constraining the search to definitions in the noun phrases closest to the abbreviation will improve precision. With the search constrained, they found that they could further tune their rules to also improve recall. Finally, there is one completely different approach to abbreviation search based on compression.16 The idea here is that a correct abbreviation gives better clues to the best compression model for the surrounding text than an incorrect one. Thus, a normalized compression ratio built from the abbreviation gives a score capable of distinguishing abbreviations. This article presents three contributions: a novel algorithm for identifying abbreviations, a set of features descriptive of various types of abbreviations, and a publically accessible abbreviation server containing all abbreviation definitions found in MEDLINE.
منابع مشابه
Research Paper: Creating an Online Dictionary of Abbreviations from MEDLINE
OBJECTIVE The growth of the biomedical literature presents special challenges for both human readers and automatic algorithms. One such challenge derives from the common and uncontrolled use of abbreviations in the literature. Each additional abbreviation increases the effective size of the vocabulary for a field. Therefore, to create an automatically generated and maintained lexicon of abbrevi...
متن کاملResolving abbreviations to their senses in Medline
MOTIVATION Biological literature contains many abbreviations with one particular sense in each document. However, most abbreviations do not have a unique sense across the literature. Furthermore, many documents do not contain the long forms of the abbreviations. Resolving an abbreviation in a document consists of retrieving its sense in use. Abbreviation resolution improves accuracy of document...
متن کاملUsing MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles
Biomedical abbreviations and acronyms are widely used in biomedical literature. Since many of them represent important content in biomedical literature, information retrieval and extraction benefits from identifying the meanings of those terms. On the other hand, many abbreviations and acronyms are ambiguous, it would be important to map them to their full forms, which ultimately represent the ...
متن کاملResearch Paper: ALICE: An Algorithm to Extract Abbreviations from MEDLINE
OBJECTIVE To help biomedical researchers recognize dynamically introduced abbreviations in biomedical literature, such as gene and protein names, we have constructed a support system called ALICE (Abbreviation LIfter using Corpus-based Extraction). ALICE aims to extract all types of abbreviations with their expansions from a target paper on the fly. METHODS ALICE extracts an abbreviation and ...
متن کاملA Supervised Method for Constructing Sentiment Lexicon in Persian Language
Due to the increasing growth of digital content on the internet and social media, sentiment analysis problem is one of the emerging fields. This problem deals with information extraction and knowledge discovery from textual data using natural language processing has attracted the attention of many researchers. Construction of sentiment lexicon as a valuable language resource is a one of the imp...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002